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Abstract. We present a numerical algorithm for finding real non-negative 
solutions to polynomial equations. Our methods are based on the expectation 
maximization and iterative proportional fitting algorithms, which are used in 
statistics to find maximum likelihood parameters for certain classes of statisti- 
cal models. Since our algorithm works by iteratively improving an approximate 
solution, we find approximate solutions in the cases when there are no exact 
solutions, such as overconstrained systems. 



We present an iterative numerical method for finding non-negative solutions and 
approximate solutions to systems of polynomial equations. We require two assump- 
tions about our system of equations. First, for each equation, all the coefficients 
other than the constant term must be non-negative. Second, there is a technical 
assumption on the exponents, described at the beginning of Section [H which, for 
example, is satisfied if all non-constant terms have the same total degree. In Sec- 
tion 121 there is a discussion of the range of possible systems which can arise under 
these hypotheses. 

Because of the assumption on signs, we can write our system of equations as, 

(1) ^ aiax" = 6i for i = 1, . . . , ^, 

where the coefficients aia are non-negative and the bi are positive, and S C M"q is a 
finite set of possibly non-integer multi-indices. Our algorithm works by iteratively 
decreasing the generalized Kullback-Leibler divergence of the left-hand side and 
right-hand side of ([T]) . The generalized Kullback-Leibler divergence of two positive 
vectors a and b is defined to be 

(2) i5(a||6):^^(^5aog(^^) -6. + : 

The standard Kullback-Leibler consists only of the first term and is defined only 
for probability distributions, i.e. the sum of each vector is 1. The last two terms 
are necessary so that the generalized divergence has, for arbitrary positive vectors a 
and b, the property of being non-negative and zero exactly when a and b are equal 
(Proposition [3]). 

Our algorithm converges to local minima of the Kullback-Leibler divergence, in- 
cluding exact solutions to the system ([T|). In order to find multiple local minima, 
we can repeat the algorithm for randomly chosen starting points. For finding ap- 
proximate solutions, this may be sufhcient. However, there are no guarantees of 
completeness for the exact solutions obtained in this way. Nonetheless, we hope 
that in certain situations, our algorithm will find applications both for finding exact 
and approximate solutions. 
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Lee and Seung applied the EM algorithm to the problem of non-negative ma- 
trix factorization |^ . They introduced the generalized KuUback-Leibler divergence 
in and used it to find approximate non-negative matrix factorizations. Since 
the product of two matrices can be expressed by polynomials in the entries of the 
matrices, matrix- factorization is a special case of the equations in ([1]). 

For finding exact solutions to arbitrary systems of polynomials, there are a vari- 
ety of approaches which find all complex or all real solutions. Homotopy continua- 
tion methods find all complex roots of a system of equations 8 . Even to find only 
positive roots, these two methods finds all complex or all real solutions, respectively. 
Lasserre, Laurent, and Rostalski have applied semidefinite programming to find all 
real solutions to a system of equations and a slight modificiation of their algorithm 
will find all positive real solutions [4j[5]. Nonetheless, neither of these methods has 
any notion of approximate solutions. 

For directly finding only positive real solutions. Bates and Sottile have proposed 
an algorithm based on fewnomials bounds on the number of real solutions [I]. 
However, their method is only effective when the number of monomials (the set S 
in our notation) is only slightly more than the number of variables. Our method 
only makes weak assumptions on the set of monomials, but stronger assumptions 
on the coefficients. 

Our inspiration comes from tools for maximum likelihood estimation in statistics. 
Parameters which maximize the likelihood are exactly the parameters such that the 
model probabilities are closest to the empirical probabilities, in the sense of mim- 
imizing KuUback-Leibler divergence. Expectation-Maximization Sec. 1.3] and 
Iterative Proporitional Fitting fS* are well-known iterative methods for maximum 
likelihood estimation. We re-interpret these algorithms as methods for approxi- 
mating solutions to polynomial equations, in which case their applicability can be 
somewhat generalized. 

The impetus behind the work in this paper was the need to find approximate 
positive solutions to systems of bilinear equations in [2] . In this case the variables 
represented expression levels, which only made sense as positive parameters. More- 
over, in order to accomodate noise in the data, there were more equations than 
variables, so it was necessary to find approximate solutions. Thus, the algorithm 
described in this paper was the most appropriate tool. Here, we generalize beyond 
bilinear equations and present proofs. 

An implementation of our algorithm in the C programming language is freely 
available at http : / /math . berkeley . edu/~dustin/pos/' 

In Section [1] we describe the algorithm and the connection to maximum like- 
lihood estimation. In Section [21 we prove that the necessary convergence for our 
algorithm. Finally, in Section [3l we show that even with our restrictions on the 
form of the equations, there can be exponentially positive real solutions. 

1. Algorithm 

We make the assumption that we have an s x n non-negative matrix g, with no 
column identically zero, and positive real numbers dj for 1 < j < s such that for 
each a G S and each j < s, X)"^! 9ji'^i is either or dj. For example, if all the 
monomials a;" have the same total degree di, we can take s = 1 and gn = 1 for 
all i. The other case of particular interest is multilinear systems of equations, in 
which each ai is at most one. In this case the variables can be partitioned into sets 
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such that the equations are Unear in each set of variables, so we can take dj = 1 
for all j. Note that because dj is in the denominator in (|4]), convergence is fastest 
when the dj are small, such as in the multilinear case. We also note that, for an 
arbitrary set of exponents S, there may not exist such a g. 

The algorithm begins with a randomly chosen starting vector and iteratively 
improves it through two nested iterations: 

• Initialize x with n randomly chosen positive real numbers. 

• Loop until the vector x stabilizes: 

• For all a G S, compute 

• Loop until the vector x stablizes: 

• Loop for j from 1 to s: 

• Simultaneously update all entries of x: 

(4) Xi ^ Xi ( "'-^J'^" ^ where = ma- 

Because there is no subtraction, it is clear that the entries of x remain positive 
throughout this algorithm. 

Our method is inspired by interpreting the equations in ([l} as a maxmimum 
likelihood problem for a statistical model and applying the well-known methods of 
Expectation-Maximization (EM) and Iterative Proportional Fitting (IPF). Here, 
we assume that all the monomials have the same total degree. Our statistical 
model is that a hidden process generates an integer i < t and an exponent vector 
a with joint probability aiax". The vector x contains n positive parameters for 
the model, restricted such that the total probability ^ • Qirj^CtX IS 1. The empirical 
data consists of repeated observations of the integer i, but not the exponent a, and 
hi is the proportion of observations of i. In this situation, the vector x which mini- 
mizes the divergence of ([1]) is the maximum likelihood parameters for the empirical 
distribution bi. The inner loop of the algorithm consists of using IPF to solve the 
log-linear hidden model and the outer loop consists of using EM to estimate the 
distribution on the hidden states. 



2. Proof of convergence 
In this section we prove our main theorem: 
Theorem 1. The Kullhack-Leibler divergence 

(5) ^i?('^a,„x"||5,y 

i=l \a£S I 

is weakly decreasing during the algorithm in Section [7J Moreover, assuming that 
the set S contains a multiple of each unit vector Ci, i.e. some power of each Xi 
appears in the system of equations, then the vector x coverges to a critical point of 
the function ^ or the boundary of the positive orthant. 

Remark 2. The condition on S is necessary to ensure that the vector x remains 
hounded during the algorithm. 
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We begin by establishing several basic properties of the generalized KuUback- 
Leibler divergence in Proposition [3] and Lemmas |4] and El Note that D {a\\b) = 
SilLi ^ (flill^i) Ei-nd thus, we will check these basic properties by reducing to the 
case where a and b are scalars. 

The proof of Theorem[l]itself is divided into two parts, corresponding to the two 
nested iterative loops. The first step is to prove that the updates (jlj in the inner 
loop converge a local minimum of the divergence D {aaX^Wwa)- The second step 
is to show that this implies that the outer loop strictly decreases the divergence 
function ([SJ exact at a critical point. 

Proposition 3. For a and b vectors of positive real numbers, the divergence D {a\\b) 
is non-negative with D {a\\b) = if and only if a — b. 

Proof. It suffices to prove this when a and b are scalars. We let t ~ b/a, and, 
D {a\\b) = felog ^-^ - b + a ^ a{tlogt - t + 1) = a logsds. 

It is clear that log s ds is non- negative and equal to zero if and only if t = 1 . □ 

Lemma 4. Suppose that a and b are vectors of m positive real numbers. Let t be 
any positive real number, and then 



D {ta\\b) = D {a\\b) + (t - 1) ^ - ^ h \ogt 



i=l 1=1 



Proof. Again, we assume that a and b are scalars and then it becomes a straight- 
forward computation. □ 

Lemma 5. If a and b are vectors of m positive real numbers, then we can relate 
their divergence to the divergence of their sums by 

D {a\\b) = D {YZi «dl E" 1 h) + D (^gii^allfc) . 

Proof. We let A = B = X)"=i ^i, and apply LemmaUto the last term: 

D (^^a\\b^ = D {a\\b) + - 1^ A - S log | 

= D (a||6) ~ B\og — + B - A = D {a\\b) - D {A\\B) . 
After rearranging, we get the desired expression. □ 
Lemma 6. The update rule ^ weakly decreases the divergence. If 



then 
(6) 
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Proof. First, since the statement only depends on the jth row of the matrix g, we 
can assume that g is a row vector and we drop j from future subscripts. Second, 
we can assume that d = 1 by replacing gi with gi/d. 

Third, we reduce to the case when gi — 1 for all i. We define a new set of 
exponents a and coefhcients 0,5 by ai = giai and da — aaYi^ii where the product 
is taken over all indices i such that gi = 0. We take i to be a vector indexed 
by those i such that gi ^ 0. Then, under the change of coordinates Xi — x\^^' , 
we have aax'^ = hax'^ and the update rule in ^ is the same for the new system 
with coefficients ha and exponents a. Furthermore, if all entries of a are zero, then 

= 1 for all vectors x and so we can drop a from our exponent set. Therefore, for 
the rest of the proof, we drop the tildes, and assume that ai ~ 1 for all a E S 
and 5i = 1 for all i, in which case g drops out of the equations. 

To prove the desired inequality, we substitute the updated assignment x' into 
the definition of KuUback-Leibler divergence: 



D {aaix')°'\\Wa) = Wa log 



- Wq, + aaix'Y 



. Wa sr^ , Hfi^^^H , , ,.a 

^Walog r-> aiWa log ^ TT-Wa+aaix) 



(7) = D {aaX°-\\wa) - 2^ aiWa log -^^-^ - aax" + aa{x')". 



On the other hand, let C denote the last term of dH), which we can expand as. 



C = ^ -D ( ^ aiUaX^W ^ UtWa j 
i—l \ ct a / 

= > > azWa log „ J - aiWa + a^aaX 

(8) -l^il.^^^'^^Og—^ ~Wa+aaX , 



where the last step follows from the assumption that that J^i Q^i = 1 for all a E S. 
We take the sum of ([7]) over all a G and add it to (|8]) to get. 



(9) J2 ^ MxTWWa) +C^J2^ iaaX"\\Wa) " ^ + E «"(^')"- 
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Finally, we expand the last term of ^ using the definition of x' and apply the 
arithmetic-geometric mean inequality, 



1=1 13 13 

Together with ([5]), this gives the desired inequality. □ 

Proposition 7. A positive vector x is a fixed point of the update rule ^ for all 1 < 
j < s if and only if x is a critical point of the divergence function X]q ^ (o-aX^Wwa). 

Proof. For the update rule to be constant means that the numerator and denomi- 
nator in (|4]) are equal, i.e. 

(10) y^ aigjiOax" = y^ cagjiWa for all i and j. 

a a 

By our assumption on g, for each i, some gji is non-zero, so ([TU| is equivalent to 

(11) aiOax" ~ aiWa for all i. 

a a 

On the other hand, we compute the partial derivative 

*9 V ^ 7-1 / O- I I \ X ^ ^ i 

— 2_^D{aaX \\Wa) ^ 2_^-Wa V aiOa . 

OXi Xi Xi 

a a 

Since each Xi is assumed to be non-zero, it is clear that all partial derivatives being 
zero is equivalent to ([TT|) . □ 

Lemma 8. // we define Wa as in (0), then 



1=1 



(a„(x')" Ike) - E ^ («aa;"|lw„) 



Moreover, a positive vector x is a fixed point if and only if x is a critical point for 
the divergence function. 

Proof. We consider 

biO^ioiX 



(12) ^i?(a,„(a;')"|k«c.) where 
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and apply Lemma [5] in two different ways. First, by applying Lemma [5] to each 
group of ([T2|) with fixed a, we get 



y^D {a,^{x'r\\w,^) ^y^D {a^{x'r\\wo.) + y^D ( ^^7° "'"(^')°ii^^.c. 
rt ^ 



l.CL 



In the last term, the monomials (x')" cancel and so it is a constant independent of 
x' which we denote E. On the other hand, we can apply Lemma [S] to each group 
in with fixed i, 



^i^ia.i'^ ) II 

f,aif}{x')P 

We can combine these equations to get 
(13) 

bia.,a{x'Y 

By Proposition [21 the last term of[T3]is non- negative, and by the definition of Wia, 
it is zero for x' — x. Therefore, any value of x' which decreases the first term 
compared to x will also decrease the left hand side by at least as much, which i 
sthe desired inequality. 

In order to prove the statement about the derivative, we consider the derivative 
of (jl3l) at x' = X. Because the last term is mimimized at x' ~ x, its derivative is 
zero, so 



d 



c'—x i \ a / i 



Y,D{aUx'T\\w^o.) 



x'=x «,a 

By Proposition [7J a positive vector x is a fixed point of the inner loop if and only if 
these partial derivatives on the right are zero for all indices j, which is the definition 
of a critical point. □ 

Proof of Theorem\^ The Kullback-Leibler divergence (a^a;" Hwq) decreases 

at each step of the inner loop by Lemma [6] Thus, by Lemma |8j the divergence 



(14) yD\ya,o.x 




decreases at least as much. However, the divergence ([T^ is non-negative according 
to Proposition [S] Therefore, the magnitude of the decreases in divergence must 
approach zero over the course of the algorithm. By Lemma |6l this means that the 
quantity C in that theorem must approach zero. By Proposition |31 this means that 
the quantities in that divergence approach each other. However, up to a power, 
these are the numerator and denominator of the factor in the update rule Q, so 
the difference between consecutive vectors x approaches zero. 

Thus, we just need to show that x remains bounded. However, since some power 
of each variable Xi occurs in some equation, as Xi gets large, the divergence for that 
equation also gets arbitrarily large. Therefore, each Xi must remain bounded, so 
the vector x must have a limit as the algorithm is iterated. If this limit is in the 
interior of the positive orthant, then it must be a fixed point. By Lemma [S] and 
Proposition [3 this fixed point must be a critical point of the divergence ([5]). □ 
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3. Universality 

Although the restriction on the exponents and especiahy the positivity of the 
coefRcients seem hke strong conditions, such systems can nonetheless be quite com- 
plex. In this section, we investigate the breadth of such equations. 

Proposition 9. For any system of £ real polynomials in n variables, there exists 
a system of £ + 1 equations in n + 1 variables, in the form such that the 

positive solutions (xi, . . . , a;„) to the former system are in hijection with the positive 
solutions (x'l, . . . , x'„^i) of the latter, with x[ — Xi/xn+i- 

Proof. We write our system of equations as X^aeS C'iax'^ = for 1 < i < £, where 
S C N" is an arbitrary finite set of exponents and aia are any real numbers. We 
let d be the maximum degree of any monomial x" for a G S*. We homogenize 
the equations with a new variable Xn+i- Explicitly, define S' C N""*"^ to consist 
of a' = {a,d — ^iOii) for all a in S" and we write aia' — aia- We add a new 
equation with coefficients o^+i^q, = 1 for all a £ S' and = 1. For this system, 
we can clearly take gu = 1 and di = d to satisfy the condition on exponents. Fur- 
thermore, for any positive solution (xi, . . . , x„) to the original system of equations, 

(xi,...,x^+i) with x- = x,/{Y,aX°'Y^'^ and xj^+i = l/( Ea 2^") ^^"^ is a solution 
to the homogenized system of equations. 

Next, we add a multiple of the last equation to each of the others in order to 
make all the coefficients positive. For each 1 < i < £, choose a positive hi > 
— min^ {aiQ, | a £ S'}, and define a'^^ — Oia + bi. By construction, the resulting 
system has all positive coefficients, and since the equations are formed from the 
previous equations by elementary linear transformations, the set of solutions are 
the same. □ 

The practical use of the construction in the proof of Proposition[5]is mixed. The 
first step, of homogenizing to deal with arbitrary sets of exponents, is a straight- 
forward way of guaranteeing the existence of the matrix g. However, for large 
systems, the second step tends to produce an ill-conditioned coefficient matrix. In 
these cases, our algorithm converges very slowly. Nonetheless, Proposition |9] shows 
that in the worst case, systems satisfying our hypotheses can be as complicated as 
arbitrary polynomial systems. 

Proposition 10. There exist bilinear equations in 2m variables with posi- 
tive real solutions. 

Proof. We use a variation on the technique used to prove Proposition |9l 

First, we pick 2m — 2 generic homogeneous linear functions 6i, . . . , 62m-2 on m 
variables. By generic, we mean for any m of the bk, the only simultaneous solution 
of all m linear equations is the trivial one. This genericity implies that any m — 1 
of the bk define a point in P™^i By taking a linear changes of coordinates in each 
set of variables, we can assume that all of these points are positive, i.e. have a 
representative consisting of all positive real numbers. 
Then we consider the system of equations 

(15) bk{xi,...,Xm) •&fe(x„j+i,...,x„) = 0, for 1 < fc < 2m-2 

(16) (Xi -f . . . + X„)(x„+i + . . . + .X2,n) = 1 

(17) xi + . . . + x„i = 1. 
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The equations ()15|) are bihomogeneous and so we can think of their solutions in 
pm-i ^ p™-i. There are exactly i^^Zi) positive real solutions, corresponding to 
the subsets A C [2m — 2] of size m — 1. For any such A, there is a unique, distinct 
solution satisfying bk{xi, . . . ,Xm) — for all fc in ^ and bk{xm+iT ■ ■ ,X2m) = 
for all k not in A. By assumption, for each solution, all the coordinates can be 
chosen to be positive. The last two equations ([T6|) and (|17p dehomogenize the 
system in a way such that there are ( ) positive real solutions. Finally, as in 
the last paragraph of the proof of Proposition El we can add multiples of p6|) to 
the equations (fT5|l in order to make all the coefficients positive. □ 
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